Store origin-code in ARCRecord header#52
Conversation
| bodyOffset += getTokenizedHeaderLine(in, secondLineValues); | ||
| version = ((String)secondLineValues.get(0) + | ||
| "." + (String)secondLineValues.get(1)); | ||
| origin = (String)secondLineValues.get(2); |
There was a problem hiding this comment.
Is it safe to assume there is always an Origin string at position 2? Or do we need to check secondLineValues.size() > 2?
There was a problem hiding this comment.
I don't know if it is safe to assume it is there, but I do know that that field will frequently contain erroneous information.
I just had a look at our own ARCs and they all have InternetArchive set for this "origin". Seems webarchive-commons (and thus Heritrix) ARCWriter just hardcodes that value: https://github.com/jrwiebe/webarchive-commons/blob/master/src/main/java/org/archive/io/arc/ARCWriter.java#L272
Given how prevalent this abuse is by now, I think it is safe to say that this field has zero informational value.
|
Looks good to me, although I don't know if the origin field is always present. Maybe @kris-sigur or @johnerikhalse have more experience of ARC files? |
|
According to the ARC format grammar origin-code is required, but I have no idea if all ARC-generating tools respect this: https://archive.org/web/researcher/ArcFileFormat.php |
|
Backwards compatibility is very important when dealing with ARCs. Historically, there have been all types of "mangled" ARCs as different tools have come and gone. I don't know if that applies to this Or to put it another way, any ARC (even technically invalid according to the spec) that could be parsed before this change, should still parse after this change. Which is clearly not the case. |
|
Having said all that, I think I'd accept this change as long as it checked there was a third element in that list before |
|
@anjackson Yes. That was substantially what I meant. Perhaps also adding a note to the Javadoc for the |
|
I was about to add a check as per @anjackson's suggestion and discovered this is actually already done by |
|
Excellent. So, if you add a 1.1.7 section and a note about this change to the CHANGES.md then I think we're good to go. |
It would be useful for our purposes if the origin-code from the version block of an ARC file were stored and made accessible by a method in ARCRecordMetaData. In my fork this method is called
getOrigin().